A Florida health insurance company wants to predict annual claims for individual clients. The company pulls a random sample of 100 customers. The owner wishes to charge an actuarially fair premium to ensure a normal rate of return. The owner collects all of their current customer’s health care expenses from the last year and compares them with what is known about each customer’s plan.

The data on the 100 customers in the sample is as follows:

  • Charges: Total medical expenses for a particular insurance plan (in dollars)
  • Age: Age of the primary beneficiary
  • BMI: Primary beneficiary’s body mass index (kg/m2)
  • Female: Primary beneficiary’s birth sex (0 = Male, 1 = Female)
  • Children: Number of children covered by health insurance plan (includes other dependents as well)
  • Smoker: Indicator if primary beneficiary is a smoker (0 = non-smoker, 1 = smoker)
  • Cities: Dummy variables for each city with the default being Sanford

Answer the following questions using complete sentences and attach all output, plots, etc. within this report.

insurance <- read.csv("../CodingAssignment03/Insurance_Data_Group9.csv")
gt(head(insurance)) # the gt function only makes it look nicer
Charges Age BMI Female Children Smoker WinterSprings WinterPark Oviedo
7742.110 46 26.620 0 1 0 0 1 0
2026.974 21 39.490 1 0 0 0 1 0
2196.473 18 25.080 1 0 0 1 0 0
40720.551 46 30.495 0 3 1 0 0 0
9877.608 51 37.730 1 1 0 0 1 0
11987.168 55 33.880 0 3 0 0 1 0

Question 1

Randomly select 30 observations from the sample and exclude from all modeling. Provide the summary statistics (min, max, std, mean, median) of the quantitative variables for the 70 observations.

set.seed(123457)
exclude <- sample(nrow(insurance), 30)
train <- insurance[-exclude,] # data for 30 excluded observations
test <- insurance[exclude,] # data for 70 included observations

train %>%
  select(Charges, Age, BMI, Children) %>%
  tbl_summary(statistic = list(all_continuous() ~ c("{mean}",       # Mean
                                                    "{sd}",         # Standard Deviation
                                                    "{median}",     # Median
                                                    "{min}",        # Minimum
                                                    "{max}"         # Maximum
                                                    )
                               ),
    type = all_continuous() ~ "continuous2" # Enhanced formatting for continuous variables
  )
Characteristic N = 701
Charges
    Mean 13,375
    SD 12,237
    Median 9,570
    Min 1,136
    Max 51,195
Age
    Mean 41
    SD 14
    Median 44
    Min 18
    Max 64
BMI
    Mean 31.1
    SD 5.7
    Median 30.8
    Min 16.0
    Max 47.7
Children
    0 27 (39%)
    1 15 (21%)
    2 16 (23%)
    3 9 (13%)
    4 2 (2.9%)
    5 1 (1.4%)
1 n (%)

Question 2

Provide the correlation between all quantitative variables

cor(train[, c("Charges", "Age", "BMI", "Children")])
##            Charges       Age       BMI  Children
## Charges  1.0000000 0.2669529 0.2394231 0.2497586
## Age      0.2669529 1.0000000 0.2372201 0.2987257
## BMI      0.2394231 0.2372201 1.0000000 0.1781634
## Children 0.2497586 0.2987257 0.1781634 1.0000000

Question 3

Run a regression that includes all independent variables in the data table. Does the model above violate any of the Gauss-Markov assumptions? If so, what are they and what is the solution for correcting?

allvar <- lm(Charges ~ Age + BMI + Female + Children + Smoker + WinterPark + WinterSprings + Oviedo, data = train)
summary(allvar)
## 
## Call:
## lm(formula = Charges ~ Age + BMI + Female + Children + Smoker + 
##     WinterPark + WinterSprings + Oviedo, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11292.8  -2542.3     18.9   3010.9  18364.1 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -15510.23    3872.49  -4.005 0.000171 ***
## Age              236.62      47.33   4.999 5.15e-06 ***
## BMI              457.04     117.63   3.886 0.000254 ***
## Female         -1722.18    1286.24  -1.339 0.185562    
## Children        1401.17     546.30   2.565 0.012799 *  
## Smoker         23987.60    1550.21  15.474  < 2e-16 ***
## WinterPark     -3948.67    1775.32  -2.224 0.029846 *  
## WinterSprings   -382.03    1950.68  -0.196 0.845382    
## Oviedo          -309.50    1913.52  -0.162 0.872042    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5074 on 61 degrees of freedom
## Multiple R-squared:  0.848,  Adjusted R-squared:  0.8281 
## F-statistic: 42.54 on 8 and 61 DF,  p-value: < 2.2e-16
plot(allvar)

Yes, the plots above show some of the Gauss-Markov assumptions have been violated.

The ‘Residuals versus Fitted’ graph shows that there is a non-linear relationship between the variables. This violates assumption 3 of the Gauss-Markov theorem, which states that independent variables are not correlated with error. This nonlinear relationship also show that there is specification bias.

The second plot titled “Normal Q-Q” shows the assumption of a normally distributed dependent variable for a fixed set of predictors. The line shown is fairly a 45-degree line upwards, although there are some observations which fall above and below the line. This makes it non-linear, which violates 6th assumption of the Gauss-Markov assumptions.

Finally, the third plot titled “Scale-Location” checks for homoskedasticity, which is shown on this graph and violates assumption 4 of the Gauss-Markov theorem because the data spreads apart as time increases.If this assumption were not violated, you’d see random points around a horizontal line. In this case, the line is not horizontal and the results tend to trend downward sloping, with a slight “fanning out” effect.

The last plot “Residuals vs. Leverage” keeps an eye out for regression outliers, influential observations, and high leverage points. There currently aren’t any outliers which fall outside Cook’s distance.

These should be solutions for the Gauss Markov violations:

For non-linearity, you can transform certain independent or dependent variables by using log transformations, or square root transformations. Due to the spread of data for insurance charges, taking the logarithmic function of Charges will allow us to evaluate the data by making it linear as well as stabilize variance for Heteroscedasticity.

You also could modify the model to include dummy variables that allow for different intercepts for each group, it could offer a more accurate representation of the data and alleviate the issues caused by heteroscedasticity.Most insurance do not count the number of children in a family, there are typically two types of insurance options, ‘single’ and ‘family’. So transforming the variable of Children from a quantitative to a dummy variable helps to better categorize the variable and evaluate the data.

Question 4

Implement the solutions from question 3, such as data transformation, along with any other changes you wish. Use the sample data and run a new regression. How have the fit measures changed? How have the signs and significance of the coefficients changed?

Data Tranformation

train$lnCharges <- log(train$Charges) 
hist(train$lnCharges) #after for log Charges

summary(train$lnCharges)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   7.036   8.625   9.166   9.129   9.671  10.843
hist(train$Charges) #before

train$ChargesSquared <- train$Charges^2 # charges squared
hist(train$ChargesSquared) #after

summary(train$ChargesSquared)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 1.291e+06 3.165e+07 9.162e+07 3.265e+08 2.517e+08 2.621e+09
scatterplotMatrix(train[c(10,2:3,5)]) # lnCharges

par(mfrow=c(1,2)) # Charges side by side

As seen through the histograms above, taking the log of charges will bring outliers closer to other charges. Using charges without transforming the data results in data that is skewed left, and transforming using the quadratic form of charges results in data that is drastically skewed left. Including the transformation for charges squared illustrates the inappropriate nature of squaring the dependent variable in these cases. After transforming charges into the log function, the histogram above shows that the log version brings charges closer to a normal distribution.

Data Transformation Continued

Below, we will take both the log and quadratic forms of multiple independent variables to determine better fits relative to the original linear form, which was found to be problematic.

train$ChildrenSquared <- train$Children^2 # Children squared
hist(train$ChildrenSquared) #after

summary(train$ChildrenSquared)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0     0.0     1.0     3.1     4.0    25.0
train$lnChildren <- log(train$Children) #log of children
hist(train$lnChildren) #after

summary(train$lnChildren)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    -Inf    -Inf  0.0000    -Inf  0.6931  1.6094
scatterplotMatrix(train[c(10,2:3,12)]) # lnCharges with lnChildren

par(mfrow=c(1,2)) # Charts side by side
hist(train$Age) #before

train$lnAge <- log(train$Age) # log of age
hist(train$lnAge) #after

summary(train$lnAge)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.890   3.305   3.773   3.636   3.912   4.159
hist(train$Age) #before

train$AgeSquared <- train$Age^2 # age squared
hist(train$AgeSquared) #after

summary(train$lnAge)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.890   3.305   3.773   3.636   3.912   4.159
hist(train$BMI) #before

train$lnBMI <- log(train$BMI) #log of BMI
hist(train$lnBMI) #after

summary(train$lnBMI)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.770   3.288   3.427   3.419   3.531   3.866
train$BMISquared <- train$BMI^2 # quadratic of BMI
hist(train$BMISquared)  #after

summary(train$BMISquared)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   254.7   717.5   947.6   996.7  1167.9  2279.1

Modeling

Based on the transformation above, we will create models that reflect both quadratic and linear forms for multiple independent variables.

#Model 1
model_1 <- lm(lnCharges ~., data = train[,c(10,2:9)] ) #pulling only columns I want
summary(model_1)
## 
## Call:
## lm(formula = lnCharges ~ ., data = train[, c(10, 2:9)])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.7258 -0.1434  0.0042  0.1503  1.4266 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    6.572798   0.255273  25.748   <2e-16 ***
## Age            0.035493   0.003120  11.376   <2e-16 ***
## BMI            0.020083   0.007754   2.590   0.0120 *  
## Female         0.121214   0.084788   1.430   0.1579    
## Children       0.088441   0.036012   2.456   0.0169 *  
## Smoker         1.656425   0.102189  16.209   <2e-16 ***
## WinterSprings  0.007457   0.128588   0.058   0.9539    
## WinterPark    -0.126200   0.117029  -1.078   0.2851    
## Oviedo        -0.028646   0.126139  -0.227   0.8211    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3345 on 61 degrees of freedom
## Multiple R-squared:  0.8794, Adjusted R-squared:  0.8636 
## F-statistic:  55.6 on 8 and 61 DF,  p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(model_1)

#ln charges
#Model 2
model_2 <- lm(lnCharges ~., data = train[,c(10,3:9,15)] ) #pulling only columns I want

summary(model_2)
## 
## Call:
## lm(formula = lnCharges ~ ., data = train[, c(10, 3:9, 15)])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.87652 -0.16920  0.02645  0.16491  1.41726 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    7.189e+00  2.696e-01  26.663  < 2e-16 ***
## BMI            2.127e-02  8.500e-03   2.503  0.01501 *  
## Female         1.304e-01  9.298e-02   1.402  0.16598    
## Children       1.065e-01  3.919e-02   2.718  0.00854 ** 
## Smoker         1.663e+00  1.122e-01  14.823  < 2e-16 ***
## WinterSprings  6.428e-03  1.410e-01   0.046  0.96380    
## WinterPark    -1.267e-01  1.284e-01  -0.987  0.32778    
## Oviedo        -2.027e-02  1.386e-01  -0.146  0.88417    
## AgeSquared     4.115e-04  4.172e-05   9.862 3.02e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3669 on 61 degrees of freedom
## Multiple R-squared:  0.8549, Adjusted R-squared:  0.8359 
## F-statistic: 44.92 on 8 and 61 DF,  p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(model_2)

#lnCharges and AgeSquared
#Model 3
model_3 <- lm(lnCharges ~., data = train[,c(10,14,3:9)] ) #pulling only columns I want

summary(model_3)
## 
## Call:
## lm(formula = lnCharges ~ ., data = train[, c(10, 14, 3:9)])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.55977 -0.14355  0.02305  0.10310  1.37177 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    3.188562   0.404219   7.888 6.85e-11 ***
## lnAge          1.339429   0.106512  12.575  < 2e-16 ***
## BMI            0.019762   0.007223   2.736  0.00814 ** 
## Female         0.109889   0.079076   1.390  0.16968    
## Children       0.072828   0.033830   2.153  0.03530 *  
## Smoker         1.631300   0.095059  17.161  < 2e-16 ***
## WinterSprings  0.002113   0.119861   0.018  0.98599    
## WinterPark    -0.130848   0.109018  -1.200  0.23468    
## Oviedo        -0.040374   0.117440  -0.344  0.73219    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3118 on 61 degrees of freedom
## Multiple R-squared:  0.8952, Adjusted R-squared:  0.8815 
## F-statistic: 65.13 on 8 and 61 DF,  p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(model_3)

#lnCharges and lnAge
#Model 4
model_4 <- lm(lnCharges ~., data = train[,c(10,2,4:9,17)] ) #pulling only columns I want

summary(model_4)
## 
## Call:
## lm(formula = lnCharges ~ ., data = train[, c(10, 2, 4:9, 17)])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.70989 -0.16308  0.00825  0.14457  1.43884 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    6.886e+00  1.816e-01  37.926   <2e-16 ***
## Age            3.573e-02  3.106e-03  11.503   <2e-16 ***
## Female         1.227e-01  8.492e-02   1.445   0.1536    
## Children       8.775e-02  3.610e-02   2.431   0.0180 *  
## Smoker         1.656e+00  1.023e-01  16.188   <2e-16 ***
## WinterSprings  6.694e-05  1.289e-01   0.001   0.9996    
## WinterPark    -1.288e-01  1.176e-01  -1.096   0.2776    
## Oviedo        -2.688e-02  1.262e-01  -0.213   0.8321    
## BMISquared     3.049e-04  1.191e-04   2.561   0.0129 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3349 on 61 degrees of freedom
## Multiple R-squared:  0.8791, Adjusted R-squared:  0.8633 
## F-statistic: 55.46 on 8 and 61 DF,  p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(model_4)

#lnCharges and BMISquared
#Model 5
model_5 <- lm(lnCharges ~., data = train[,c(10,2,4:9,16)] ) #pulling only columns I want
summary(model_5)
## 
## Call:
## lm(formula = lnCharges ~ ., data = train[, c(10, 2, 4:9, 16)])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.73920 -0.13134  0.00264  0.13730  1.41835 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    5.101250   0.784230   6.505 1.63e-08 ***
## Age            0.035289   0.003141  11.236  < 2e-16 ***
## Female         0.118779   0.084788   1.401   0.1663    
## Children       0.089731   0.035970   2.495   0.0153 *  
## Smoker         1.655747   0.102238  16.195  < 2e-16 ***
## WinterSprings  0.017437   0.128540   0.136   0.8925    
## WinterPark    -0.120176   0.116477  -1.032   0.3063    
## Oviedo        -0.029406   0.126237  -0.233   0.8166    
## lnBMI          0.613866   0.238249   2.577   0.0124 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3346 on 61 degrees of freedom
## Multiple R-squared:  0.8793, Adjusted R-squared:  0.8634 
## F-statistic: 55.53 on 8 and 61 DF,  p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(model_5)

#lnCharges and lnBMI
#Model 6
model_6 <- lm(lnCharges ~., data = train[,c(10,2:4,6:9,12,5)] ) #pulling only columns I want

summary(model_6)
## 
## Call:
## lm(formula = lnCharges ~ ., data = train[, c(10, 2:4, 6:9, 12, 
##     5)])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.70026 -0.14405 -0.00955  0.13481  1.37468 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      6.620186   0.259924  25.470  < 2e-16 ***
## Age              0.035077   0.003150  11.136 3.11e-16 ***
## BMI              0.018474   0.007929   2.330   0.0232 *  
## Female           0.132595   0.085614   1.549   0.1267    
## Smoker           1.650975   0.102378  16.126  < 2e-16 ***
## WinterSprings   -0.007636   0.129557  -0.059   0.9532    
## WinterPark      -0.137528   0.117643  -1.169   0.2470    
## Oviedo          -0.055094   0.129051  -0.427   0.6710    
## ChildrenSquared -0.024082   0.024630  -0.978   0.3321    
## Children         0.171712   0.092472   1.857   0.0682 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3346 on 60 degrees of freedom
## Multiple R-squared:  0.8813, Adjusted R-squared:  0.8635 
## F-statistic: 49.49 on 9 and 60 DF,  p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(model_6)

#lnCharges and ChildrenSquared
#Model 7

#make children a dummy variable
train$childdummy <- 0 
train$childdummy[train$Children > 0] <- 1

model_7 <- lm(lnCharges ~., data = train[,c(10,2:4,6:9,18)] ) #pulling only columns I want
summary(model_7)
## 
## Call:
## lm(formula = lnCharges ~ ., data = train[, c(10, 2:4, 6:9, 18)])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.65227 -0.13093 -0.01566  0.13835  1.37804 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    6.642231   0.251498  26.411  < 2e-16 ***
## Age            0.035427   0.003043  11.644  < 2e-16 ***
## BMI            0.016919   0.007792   2.171  0.03380 *  
## Female         0.133797   0.083224   1.608  0.11307    
## Smoker         1.663931   0.100444  16.566  < 2e-16 ***
## WinterSprings -0.023576   0.121822  -0.194  0.84719    
## WinterPark    -0.161530   0.112590  -1.435  0.15649    
## Oviedo        -0.068542   0.123236  -0.556  0.58012    
## childdummy     0.263374   0.090093   2.923  0.00485 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3284 on 61 degrees of freedom
## Multiple R-squared:  0.8838, Adjusted R-squared:  0.8685 
## F-statistic: 57.97 on 8 and 61 DF,  p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(model_7)

#dummy variable children
#Model 8
model_8 <- lm(lnCharges ~., data = train[,c(10,2:9,15)] ) #pulling only columns I want

summary(model_8)
## 
## Call:
## lm(formula = lnCharges ~ ., data = train[, c(10, 2:9, 15)])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.58721 -0.15275  0.02552  0.10953  1.38677 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    5.8158862  0.3887536  14.960  < 2e-16 ***
## Age            0.0799787  0.0179957   4.444 3.87e-05 ***
## BMI            0.0201024  0.0074383   2.703  0.00894 ** 
## Female         0.1124223  0.0814137   1.381  0.17244    
## Children       0.0727190  0.0351110   2.071  0.04266 *  
## Smoker         1.6373813  0.0983248  16.653  < 2e-16 ***
## WinterSprings  0.0070994  0.1233556   0.058  0.95430    
## WinterPark    -0.1373286  0.1123543  -1.222  0.22638    
## Oviedo        -0.0534739  0.1214107  -0.440  0.66120    
## AgeSquared    -0.0005500  0.0002194  -2.507  0.01490 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3209 on 60 degrees of freedom
## Multiple R-squared:  0.8908, Adjusted R-squared:  0.8745 
## F-statistic:  54.4 on 9 and 60 DF,  p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(model_2)

#lnCharges and Age + AgeSquared

Question 5

Use the 30 withheld observations and calculate the performance measures for your best two models. Which is the better model? (remember that “better” depends on whether your outlook is short or long run)

bad_model <- lm(Charges ~., data = insurance)
summary(bad_model)
## 
## Call:
## lm(formula = Charges ~ ., data = insurance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12267.4  -2639.4   -112.3   2385.5  19603.5 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -13823.02    3448.79  -4.008 0.000125 ***
## Age              251.26      45.76   5.491 3.58e-07 ***
## BMI              394.14     100.36   3.927 0.000167 ***
## Female          -718.53    1189.05  -0.604 0.547154    
## Children        1075.41     534.90   2.010 0.047340 *  
## Smoker         24707.50    1343.92  18.385  < 2e-16 ***
## WinterSprings  -1338.49    1727.60  -0.775 0.440484    
## WinterPark     -3937.16    1578.52  -2.494 0.014429 *  
## Oviedo          -796.89    1769.67  -0.450 0.653561    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5602 on 91 degrees of freedom
## Multiple R-squared:  0.8194, Adjusted R-squared:  0.8036 
## F-statistic: 51.62 on 8 and 91 DF,  p-value: < 2.2e-16
test$lnCharges <- log(test$Charges)
test$lnAge <- log(test$Age)
test$lnBMI <- log(test$BMI)
test$bad_model_pred <- predict(bad_model, newdata = test)

test$lnCharges <- log(test$Charges)
test$lnage <- log(test$Age)
test$lnBMI <- log(test$BMI)

test$model_3_pred <- predict(model_3, newdata = test) %>% exp()
#lnAge

test$model_5_pred <- predict(model_5, newdata = test) %>% exp()
#lnBMI

# Finding the error

test$error_bm <- test$bad_model_pred - test$Charges

test$error_3 <- test$model_3_pred - test$Charges

test$error_5 <- test$model_5_pred - test$Charges

Bias

# Bad Model
mean(test$error_bm)
## [1] -381.4727
# Model 1
mean(test$error_3)
## [1] -1048.058
# Model 2
mean(test$error_5)
## [1] -1729.154

MAE

# Function to calculate MAE

mae <- function(error_vector){
  error_vector %>% 
  abs() %>% 
  mean()
}

# Bad Model
mae(test$error_bm)
## [1] 4582.452
# Model 1
mae(test$error_3)
## [1] 3813.108
# Model 2
mae(test$error_5)
## [1] 3933.514

RMSE

rmse <- function(error_vector){
   error_vector^2 %>% 
  mean() %>% 
  sqrt()

}

# Bad Model
rmse(test$error_bm)
## [1] 6428.387
# Model 1
rmse(test$error_3)
## [1] 6747.818
# Model 2
rmse(test$error_5)
## [1] 6804.108

MAPE

mape <- function(error_vector, actual_vector){
  (error_vector/actual_vector) %>% 
    abs() %>% 
    mean()
}

# Bad Model
mape(test$error_bm, test$saleprice)
## [1] NaN
# Model 1
mape(test$error_3, test$saleprice)
## [1] NaN
# Model 2
mape(test$error_5, test$saleprice)
## [1] NaN

All things considered, though Model 2 has the lowest MAE, Model 1 is the least biased and has the lowest RMSE. Therefore, it would be the best model to use long-term.

Question 6

Provide interpretations of the coefficients, do the signs make sense? Perform marginal change analysis (thing 2) on the independent variables.

-The summary regression output from question 5 shows that Age, BMI, Children, Smoker, as well as the default Sanford all having a direct correlation with charges. The variables Female, WinterSprings, WinterPark, and Oviedo all have an inverse effect on charges. All of the signs make sense due to the fact of the increased health risks associated with many of the variables that have a direct relationship with health expenses.

-For the variable Age, each additional year adds an expected $251.26 in medical expenses, For every increase in 1 in BMI score, there is an expected increase of $394.14 in medical expenses. If the customer is a Female, there is an expected $718.53 decrease in medical expenses in comparison to a male. If the customer is a child, there is an expected $1075.41 increase in medical expenses in comparison to a non-child. If the customer is a smoker, there is an expected increase of $24707.50 in medical expenses compared to a non-smoker. If the location of the customer is in Winter Springs, there is an expected lower medical expense of $338.49 in comparison to the default location of Sanford. If the location of the customer is Winter Park, there is an expected lower medical expense of $3937.16 in comparison to the default of Sanford. If the location of the customer is Oviedo, there is an expected lower medical expense of $796.89 in comparison to the default of Sanford.

Question 7

An eager insurance representative comes back with five potential clients. Using the better of the two models selected above, provide the prediction intervals for the five potential clients using the information provided by the insurance rep.

Customer Age BMI Female Children Smoker City
1 60 22 1 0 0 Oviedo
2 40 30 0 1 0 Sanford
3 25 25 0 0 1 Winter Park
4 33 35 1 2 0 Winter Springs
5 45 27 1 3 0 Oviedo
#
name <- c("Customer", "Age", "BMI","Female","Children","Smoker","City")
Customer <- c("1","2","3","4","5")
Age <- c(60,40,25,33,45)
BMI <- c(22,30,25,35,27)
Female <- c(1,0,0,1,1)
childdummy <- c(0,1,0,2,3)
Smoker <- c(0,0,1,0,0)
WinterSprings <- c(0,0,0,1,0)
WinterPark <- c(0,0,1,0,0)
Oviedo <- c(1,0,0,0,1)
Sanford <- c(0,0,0,0,0)

#
Potential_Clients <-data.frame(Customer,Age,BMI,Female,childdummy,Smoker,WinterSprings,WinterPark,Oviedo,Sanford)

#
predict(model_7, newdata = Potential_Clients,interval = "prediction")
##        fit      lwr       upr
## 1 9.205349 8.482255  9.928443
## 2 8.830276 8.143659  9.516892
## 3 9.453296 8.754262 10.152331
## 4 9.040476 8.297437  9.783515
## 5 9.548658 8.744717 10.352600

Question 8

The owner notices that some of the predictions are wider than others, explain why.

-The prediction and confidence outputs reveal that the prediction interval has wider boundaries than the confidence interval, which is expected. This difference arises because the prediction interval accounts for both the uncertainty in estimating the population mean and the random variation of individual values.

-In this case, some prediction intervals are wider than others due to the use of independent variable values that are significantly larger or smaller than average, placing them closer to the bounds of the range. For instance, comparing the prediction intervals for row 1 (which uses the largest age and lowest BMI) and row 3 (which uses the same values for age and BMI) highlights this variation. Row 1 has a wider prediction interval of 10.01 - 8.11 = 1.90, while row 3 has a slightly narrower prediction interval of 10.44 - 8.62 = 1.82. This difference illustrates how the input data can influence the width of the prediction intervals.

Question 9

Are there any prediction problems that occur with the five potential clients? If so, explain.

-Prediction problems often involve several types of errors, including extrapolation, model bias, high variance, and the effects of non-linearity and interactions. In this context, extrapolation error is unlikely to be a concern since the range of values for each independent variable in Model 7’s training data set adequately covers the range needed for the five clients in terms of age, BMI, number of children, and other factors. However, the model exhibits a negative bias, suggesting that predictions are likely to underestimate actual values.

-Another critical consideration is the sample size. Model 7’s training dataset includes only 30 observations, which is relatively small. A larger sample size, such as 100 or more, would provide a more accurate representation and improve the model’s reliability. Additionally, when using a model with a log-transformed dependent variable like lncharges, it is crucial to back-transform the results to reflect the true dollar amounts for accurate interpretation.